Search CORE

14 research outputs found

Understanding HTML with Large Language Models

Author: Chowdhery Aakanksha
Faust Aleksandra
Fiedel Noah
Gur Izzeddin
Huang Austin
Miao Yingjie
Nachum Ofir
Narang Sharan
Safdari Mustafa
Publication venue
Publication date: 08/10/2022
Field of study

Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -- have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks using 192x less data compared to the previous best supervised model. Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl

arXiv.org e-Print Archive

Large Language Models Encode Clinical Knowledge

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications

arXiv.org e-Print Archive

Towards Generalist Biomedical AI

Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery. To enable the development of these models, we first curate MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. We also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning. To further probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales. In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. While considerable work is needed to validate these models in real-world use cases, our results represent a milestone towards the development of generalist biomedical AI systems

arXiv.org e-Print Archive

PaLM: Scaling Language Modeling with Pathways

Author: Agrawal Shivani
Austin Jacob
Barham Paul
Barnes Parker
Bosma Maarten
Bradbury James
Catasta Michele
Child Rewon
Chowdhery Aakanksha
Chung Hyung Won
Dai Andrew M.
Dean Jeff
Dev Sunipa
Devlin Jacob
Diaz Mark
Dohan David
Du Nan
Duke Toju
Eck Douglas
Fedus Liam
Fiedel Noah
Firat Orhan
Garcia Xavier
Gehrmann Sebastian
Ghemawat Sanjay
Gur-Ari Guy
Hutchinson Ben
Ippolito Daphne
Isard Michael
Lee Katherine
Levskaya Anselm
Lewkowycz Aitor
Lim Hyeontaek
Luan David
Maynez Joshua
Meier-Hellstern Kathy
Michalewski Henryk
Mishra Gaurav
Misra Vedant
Moreira Erica
Narang Sharan
Omernick Mark
Pellat Marie
Petrov Slav
Pillai Thanumalayan Sankaranarayana
Polozov Oleksandr
Pope Reiner
Prabhakaran Vinodkumar
Rao Abhishek
Reif Emily
Roberts Adam
Robinson Kevin
Saeta Brennan
Schuh Parker
Sepassi Ryan
Shazeer Noam
Shi Kensen
Spiridonov Alexander
Sutton Charles
Tay Yi
Tsvyashchenko Sasha
Wang Xuezhi
Wei Jason
Yin Pengcheng
Zhou Denny
Zhou Zongwei
Zoph Barret
Publication venue
Publication date: 19/04/2022
Field of study

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies

arXiv.org e-Print Archive

PaLM 2 Technical Report

Author: Abrego Gustavo Hernandez
Ahn Junwhan
Anil Rohan
Austin Jacob
Bailey Paige
Barham Paul
Botha Jan
Bradbury James
Brahma Siddhartha
Brooks Kevin
Catasta Michele
Chen Zhifeng
Cheng Yong
Cherry Colin
Choquette-Choo Christopher A.
Chowdhery Aakanksha
Chu Eric
Clark Jonathan H.
Crepy Clément
Dai Andrew M.
Dave Shachi
Dehghani Mostafa
Dev Sunipa
Devlin Jacob
Du Nan
Dyer Ethan
Díaz Mark
Feinberg Vlad
Feng Fangxiaoyu
Fienber Vlad
Firat Orhan
Freitag Markus
Garcia Xavier
Gehrmann Sebastian
Gonzalez Lucas
Gur-Ari Guy
Hand Steven
Hashemi Hadi
Hou Le
Howland Joshua
Hu Andrea
Huang Yanping
Hui Jeffrey
Hurwitz Jeremy
Isard Michael
Ittycheriah Abe
Jagielski Matthew
Jia Wenhao
Johnson Melvin
Kenealy Kathleen
Krikun Maxim
Kudugunta Sneha
Lan Chang
Lee Benjamin
Lee Katherine
Lepikhin Dmitry
Li Eric
Li Jian
Li Music
Li Wei
Li YaGuang
Lim Hyeontaek
Lin Hanzhao
Liu Frederick
Liu Zhongtao
Maggioni Marcello
Mahendru Aroma
Maynez Joshua
Meier-Hellstern Kathy
Mishra Gaurav
Misra Vedant
Moreira Erica
Moussalem Maysam
Nado Zachary
Nham John
Ni Eric
Nystrom Andrew
Omernick Mark
Parrish Alicia
Passos Alexandre
Pellat Marie
Petrov Slav
Polacek Martin
Polozov Alex
Pope Reiner
Qiao Siyuan
Reif Emily
Richter Bryan
Riley Parker
Robinson Kevin
Ros Alex Castro
Roy Aurko
Ruder Sebastian
Saeta Brennan
Samuel Rajkumar
Shafey Laurent El
Shakeri Siamak
Shelby Renee
Slone Ambrose
Smilkov Daniel
So David R.
Sohn Daniel
Taropa Emanuel
Tay Yi
Tokumine Simon
Valter Dasha
Vasudevan Vijay
Vodrahalli Kiran
Wang Pidong
Wang Tao
Wang Xuezhi
Wang Zirui
Wieting John
Wu Yonghui
Wu Yuhuai
Xiao Kefan
Xu Kelvin
Xu Yuanzhong
Xu Yunhan
Xue Linting
Yin Pengcheng
Yu Jiahui
Zhang Qiao
Zhang Yujing
Zheng Ce
Zheng Steven
Zhou Denny
Zhou Weikang
Publication venue
Publication date: 13/09/2023
Field of study

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report

arXiv.org e-Print Archive

Aerial Channel Prediction and User Scheduling in Mobile Drone Hotspots

Author: Aakanksha Chowdhery
Kyle Jamieson
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref